The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., decrease as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error (from an ontology of 7 types) is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) BUMP enables measuring the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, 3) BUMP enables the measurement of metrics' performance on individual error types and highlights areas of weakness for future work.
translated by 谷歌翻译
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
translated by 谷歌翻译
This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
translated by 谷歌翻译
Estimating the structure of directed acyclic graphs (DAGs) of features (variables) plays a vital role in revealing the latent data generation process and providing causal insights in various applications. Although there have been many studies on structure learning with various types of data, the structure learning on the dynamic graph has not been explored yet, and thus we study the learning problem of node feature generation mechanism on such ubiquitous dynamic graph data. In a dynamic graph, we propose to simultaneously estimate contemporaneous relationships and time-lagged interaction relationships between the node features. These two kinds of relationships form a DAG, which could effectively characterize the feature generation process in a concise way. To learn such a DAG, we cast the learning problem as a continuous score-based optimization problem, which consists of a differentiable score function to measure the validity of the learned DAGs and a smooth acyclicity constraint to ensure the acyclicity of the learned DAGs. These two components are translated into an unconstraint augmented Lagrangian objective which could be minimized by mature continuous optimization techniques. The resulting algorithm, named GraphNOTEARS, outperforms baselines on simulated data across a wide range of settings that may encounter in real-world applications. We also apply the proposed approach on two dynamic graphs constructed from the real-world Yelp dataset, demonstrating our method could learn the connections between node features, which conforms with the domain knowledge.
translated by 谷歌翻译
尽管在广泛的愿景任务中取得了诱人的成功,但变形金刚尚未在高分辨率图像生成建模中作为Convnets的讨论能力。在本文中,我们寻求探索使用纯变压器来构建用于高分辨率图像合成的生成对抗网络。为此,我们认为,当地的关注是在计算效率和建模能力之间取得平衡至关重要。因此,所提出的发电机采用基于风格的架构中的Swin变压器。为了实现更大的接收领域,我们提出了双重关注,同时利用本地和移位窗的上下文,从而提高了发电质量。此外,我们表明提供了在基于窗口的变压器中丢失的绝对位置的知识极大地利益了代理。所提出的STYLESWIN可扩展到高分辨率,粗糙几何和细结构都受益于变压器的强效力。然而,在高分辨率合成期间发生阻塞伪像,因为以块明智的方式执行局部注意力可能会破坏空间一致性。为了解决这一点,我们经验研究了各种解决方案,其中我们发现采用小波鉴别器来检查光谱差异的措施有效地抑制伪影。广泛的实验表明了对现有的基于变压器的GAN的优越性,特别是在高分辨率上,例如高分辨率,例如1024x1024。如果没有复杂的培训策略,则在Celeba-HQ 1024上赢得了STYLEGAN,并且在FFHQ-1024上实现了对PAR的表现,证明了使用变压器进行高分辨率图像生成的承诺。代码和模型将在https://github.com/microsoft/styleswin上使用。
translated by 谷歌翻译
视频Panoptic semonation(VPS)旨在为每个像素分配类标签,唯一地分割和识别所有帧的所有对象实例。经典解决方案通常将VPS任务分解为多个子任务,并利用多个代理(例如框和掩码,中心和偏移)来表示对象。然而,这种鸿沟和征服策略需要在空间和时间域中进行复杂的后处理,并且易于来自代理任务的失败。在本文中,灵感来自以对象为中心的学习,它学习紧凑且强大的对象表示,我们呈现了Slot-VPS,这是此任务的第一个端到端框架。我们在视频中编码所有Panoptic实体,包括前景实例和后台语义,其中包含称为Panoptic插槽的统一表示。通过提出的视频Panoptic Retriever检索并将相干的时空对象的信息检索并编码到Panoptic插槽中,使其能够以统一的方式本地化,段,区分和关联对象。最后,输出Panoptic插槽可以直接转换为视频中Panoptic对象的类,掩码和对象ID。我们开展广泛的消融研究,并展示了我们对两个基准数据集,CityCAPE-VPS(\ Texit {Val}和测试集)和Viper(\ Texit {val}集)的有效性,实现了新的最先进的性能分别为63.7,63.3和56.2 VPQ。
translated by 谷歌翻译
我们介绍了文本到图像生成的矢量量化扩散(VQ-扩散)模型。该方法基于矢量量化变分性AutoEncoder(VQ-VAE),其潜像通过最近开发的去噪扩散概率(DDPM)的条件变体为基础。我们发现这种潜在空间方法非常适合于图像到图像生成任务,因为它不仅消除了具有现有方法的单向偏差,还允许我们结合掩模和更换的扩散策略,以避免积累错误,这是现有方法的严重问题。我们的实验表明,与具有类似数量的参数数量的传统自回归(AR)模型相比,VQ扩散产生明显更好的文本到图像生成结果。与以前的基于GAN的文本到图像方法相比,我们的VQ扩散可以通过大边缘处理更复杂的场景并提高合成的图像质量。最后,我们表明我们的方法中的图像生成计算可以通过Reparameter化进行高效。利用传统的AR方法,文本到图像生成时间随输出图像分辨率线性增加,因此即使对于正常尺寸图像也是相当耗时的。 VQ-扩散使我们能够在质量和速度之间实现更好的权衡。我们的实验表明,具有Reparameterization的VQ扩散模型比传统的AR方法快15倍,同时实现更好的图像质量。
translated by 谷歌翻译
The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing selfdriving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world selfdriving problems, we introduce a new large-scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest cam-era+LiDAR dataset available based on our proposed geographical coverage metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.
translated by 谷歌翻译
Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstrate in Figure 1. Correctly assigning the label weights is hard even for human beings and there is no clear criterion to measure it. To tackle this problem, in this paper, we propose LUMix, which models such uncertainty by adding label perturbation during training. LUMix is simple as it can be implemented in just a few lines of code and can be universally applied to any deep networks \eg CNNs and Vision Transformers, with minimal computational cost. Extensive experiments show that our LUMix can consistently boost the performance for networks with a wide range of diversity and capacity on ImageNet, \eg $+0.7\%$ for a small model DeiT-S and $+0.6\%$ for a large variant XCiT-L. We also demonstrate that LUMix can lead to better robustness when evaluated on ImageNet-O and ImageNet-A. The source code can be found \href{https://github.com/kevin-ssy/LUMix}{here}
translated by 谷歌翻译
In this paper, we study the effects of incorporating timestamps, such as document creation dates, into generation systems. Two types of time-aware prompts are investigated: (1) textual prompts that encode document timestamps in natural language sentences; and (2) linear prompts that convert timestamps into continuous vectors. To explore extrapolation to future data points, we further introduce a new data-to-text generation dataset, TempWikiBio, containing more than 4 millions of chronologically ordered revisions of biographical articles from English Wikipedia, each paired with structured personal profiles. Through data-to-text generation on TempWikiBio, text-to-text generation on the content transfer dataset, and summarization on XSum, we show that linear prompts on encoder and textual prompts improve the generation quality on all datasets. Despite having less performance drop when testing on data drawn from a later time, linear prompts focus more on non-temporal information and are less sensitive to the given timestamps, according to human evaluations and sensitivity analyses. Meanwhile, textual prompts establish the association between the given timestamps and the output dates, yielding more factual temporal information in the output.
translated by 谷歌翻译